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METHODS FOR ANALYZING POLYMER POPULATIONS 

■bleld of the Invention 

The present invention relates generally to methods for analyzing complex 
5 mixtures of polymers. This facilitates inter alia the generation of accurate sequence 
maps of analyzed polymers (e.g., nucleic acids). 

Background of the Invention 

Analysis of polymers including sequence analysis often involves analysis of 

10 polymer mixtures. These mixtures may contain multiple copies of identical polymers, or 
they may contain multiple copies of disparate polymers (in terms of size and/or 
sequence). In the former case, even though the sample is homogeneous with respect to 
the polymer, the data generated is not directly usefial because the polymers are usually 
analyzed in an orientation-insensitive manner. As a result, each polymer is 

15 independently analyzed in either a "head-first" or a "tail-firsf ' orientation. Data sets 
resulting from randomly analyzed individual polymers cannot be superimposed due to 
the non-oriented nature of the data. 

Additionally, polymer analysis usually requires analysis of more than one (and 
often times several himdred or several thousand) copies of the same polymer. This is 

20 due to the inefficient labeling of single polymers and inefficient detection of probes that 
are minimally labeled. Labeling efficiencies of 50% to 95% are common, particularly 
when the labeling strategy involves labeling target sequence sites in nucleic acids with 
nucleic acid probes. For example, detection of single fluorophores at a high rate has an 
average efficiency of 10-90% and is dependent upon the properties of the fluorophore 

25 used as well as on the trajectory of the probe and polymer through the excitation spot of 
the detection system. 

Analysis of multiple copies of a polymer is therefore necessary in order to 
compile information for all target sequence sites in a polymer. 

Accordingly, when a sample contatas more than one copy of a particular polymer 

30 (or in more complex situations, more than one type of polymer), intensity profiles 
generated from identical and xmiformly oriented polymers are difficult to distinguish 
from all other intensity profiles. Superimposition of intensity profiles from randomly 
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oriented polymers of one or more types are not particxilarly useful, and one is left 
analyzing signals from individual polymers only. 

Thus, there exists a need for discerning individual polymer data from a 
heterogeneous sample. An example of this is the need to accurately assess polymer 
orientation in order to generate polymer sequence maps. The ability to discern polymers 
from each other and determine polymer orientation should increase the amount of usable 
sequence data available and reduce the number of polymers that need to be analyzed. 
This is particularly usefiil if there is only a limited supply of the polymer (e.g., rarely 
transcribed mRNA species). 



Summary of the Invention 

The invention provides methods and algorithms for processing polymer data. 
The method enables the identification of polymer-specific and orientation-specific data 
from a population data set. 

15 In one aspect, the invention provides a method for analyzing polymer intensity 

data from a sample. The method comprises obtaining intensity profiles from individual 
labeled polymers contained in the sample, aligning individual intensity profiles from 
individual labeled polymers with respect to an alignment reference point, combining 
aligned individual intensity profiles to generate a sample population profile, selecting a 

20 peak in the sample population profile and obtaining individual intensity profiles that 
contribute to peak, combining individual intensity profiles that contribute to the peak to 
generate a peak profile, and comparing the peak profile with the sample population 
profile. 

In one embodiment, the sample contains a heterogeneous mixture of polymers. 
25 The heterogeneous mixture of polymers may comprise differentially sized fragments of a 
parent polymer. The heterogeneous mixture of polymers may comprise polymers with 
different sequences. 

In one embodiment, the profiles are intensity versus length profiles. Length may 
be contour length or actual length, depending on the embodiment 
30 The intensity data may be fluorescence intensity data and intensity profiles may 

be fluorescence intensity profiles, but neither is so limited. 
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In one embodiment, the polymers are labeled with a sequence specific probe. 
Additionally, the polymers may be labeled with a sequence non-specific label. In one 
embodiment, the sequence non-specific label is a backbone label. 

In important embodiments, the method is implemented on a computer. 
5 In one embodiment, the polymer is a nucleic acid, such as DNA or RNA. 

In one embodimeiit, the DNA is genomic nuclear DNA, mitochondrial DNA or cDNA. 
In another embodiment, the KNA is mRNA. 

In one embodiment, the alignment reference point is an internal reference point. 
In another embodiment, the alignment reference point is a center of molecule reference 
10 point. La yet another embodiment, the alignment reference point is a sequence specific 
probe bound to individxial polymers. In still another embodiment, the alignment 
reference point is a sequence non-specific probe bound to individual polymers. In one 
embodiment, the alignment reference point is a center of molecule reference point. In 
another embodiment, the center of molecule reference point is the midpoint of an 
1 5 individual profile. 

The intensity profiles may be obtained firom individual polymers in flow, or firom 
individual polymers fixed to a solid support. Alternatively, the intensity profiles may be 
obtained from individual polymers embedded in a gel matrix. 

In one embodiment, the sample population profile is a cumulative population 
20 profile. In another embodiment, the sample popxilation profile is an averaged population 
profile. Similarly, the peak profile may be a cumulative peak profile or an averaged peak 
profile. 

In one embodiment, the peak is randomly selected. In another embodiment, the 
peak is selected based on intensity. In yet another embodiment, the peak is selected 
25 based on the presence of its mirror image pealc in the population profile. 

In some embodiments, the polymers in the sample are sorted according to size 
prior to aligning individual intensity profiles. 

In one embodiment, a pealc profile that resembles the sample population profile 
indicates a non-oriented profile. 
30 In another embodiment, a peak profile that consists of a subset of peaks firom the 

population profile indicates a putative oriented profile. In one related embodiment, the 
method fiirther comprises invertmg the putative oriented profile to generate a putative 
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inverted profile, combming the putative oriented profile with the putative inverted profile 
to generate a putative non-oriented profile, and comparing the putative non-oriented 
profile with the population profile, wherein a putative non-oriented profile that is 
identical to the population profile indicates that the putative oriented profile is an 
5 oriented profile, that the putative mverted profile is an inverted profile, and that the 
putative non-oriented profile is a non-oriented profile. 

In a second related embodiment, the method further comprises determining 
whether individual peaks in the pealc profile have corresponding mirror image pealcs in 
the population profile when the alignment reference point is a center of molecule 

10 reference point. The presence of corresponding mirror images may indicate that the 
putative oriented profile is an oriented profile. 

In a third related embodiment, the method fiirther comprises determining whether 
the oriented peak has a corresponding mirror image pealc in the population profile when 
the alignment reference point is a center of molecule reference point. This latter method 

15 may further comprise obtaming individual intensity profiles that contribute to the mirror 
image peak, and combining individual intensity profiles that contribute to the mirror 
image pealc to generate a mirror image peak profile, and optionally comparing the mirror 
image peak profile with the population profile, and optionally determining whether the 
nurror image peak profile is a mirror image of the peak profile, and optionally inverting 

20 and combining the mirror image pealc profile with the peak profile provided the nurror 
image pealc profile is a mirror image of the pealc profile. 

In one embodiment, the mirror image peak profile is a cumulative mirror image 
peak profile. In another embodiment, the mirror image peak profile is an averaged 
mirror image peak profile. 

25 The method may further comprise inverting the oriented profile, combining the 

oriented profile with the inverted profile to generate a non-oriented profile, and 
comparing the non-oriented profile with the sample population profile. 

The method may further comprise subtracting the peak profile firom the sample 
population profile, or subtracting the mirror image peak profile fi:om the sample 

30 population profile, or subtracting the pealc profile and the mirror image peak profile from 
the sample population profile. 
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The method may further comprise determining whether additional peaks remain 
in the sample population profile following subtraction of the peak profile and the mirror 
image peak profile. In a related embodiment, the presence of additional peaks is 
indicative that the sample comprised different polymers. 
5 In one embodiment, the peak is visible in an intensity versus length profile. In 

another embodiment, the peak corresponds to bin counts. 

In one embodiment, the polymer is completely stretched, while in another it is 
partially stretched. In yet another embodiment, the polyrner is uniformly stretched. 

Each of the limitations of the invention can encompass various embodiments of 
10 the invention. It is therefore anticipated that each of the limitations of the invention 

involving any one element or combinations of elements can be included in each aspect of 
the invention. This invention is not limited in its application to the details of 
construction and the arrangement of components set forth in the following description or 
illustrated in the drawings. The invention is capable of other embodiments and of being 
15 practiced or of being carried out in various ways. Also, the phraseology and terminology 
used herein is for the purpose of description and should not be regarded as limiting. The 
use of "including", "comprising", "having", "containing" or "involving" and variations 
thereof herein, is meant to encompass the items listed thereafter and equivalents thereof 
as well as additional items. 

20 

Brief Description of the Drawings 

The drawings are illustrative only and are not required for enablement of the 
invention disclosed herein. 

Fig. 1 shows the location of particular sequence sites ("target sequence sites") on 

25 a polymer (top panel), a theoretical direct signal profile of tlie polymer based on tliese 
sequence sites (middle panel), and a theoretical combination of the direct signal profile 
and the mirror image signal profile showing the dupUcate signals on either side of the 
center of the molecule (bottom panel). This latter plot is used in Fig. 2. The middle 
panel resembles and can represent a theoretical "individual intensity profile" and/or an 

30 oriented profile. The bottom panel resembles and can represent a population profile 
and/or a non-oriented profile. The arrows in the middle and bottom panels indicate the 
orientation of the polymer which contributes to the corresponding pealc. The profiles are 
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plotted as intensity (or photon count) as a function of length (i.e., position on the 
polymer). 

Fig. 2 shows signal intensity as a function of polymer position relative to the 
center of the molecule for a single polymer type prior to orientation (top panel). The 
5 observed and theoretical oriented sequence information is also shown (bottom panel). 
The observed signals are indicated by a solid line and the theoretical signals are indicated 
by a dashed line. The observed oriented profile shown ia the bottom panel was derived 
using the methods described herein. 

10 Detailed Description of the Invention 

The invention provides, inter alia, methods for evaluating and manipulatmg 
polymer sequence data. These methods are used to align, orient and thus discern signal 
profiles that are derived from individual polymers in a sample. 

The need to discern polymer profiles derives in part from the fact that generally it 
15 is impossible to label and detect all desired target sequence sites withia a polymer with 
100% efEciency (i.e., not every target site is labeled on every polymer). To compensate 
for this, multiple copies of an identical polymer are usually analyzed and the resultant 
signals are combined in order to observe and thus detect all sequence specific sites along 
the polymer. 

20 As used hereia, analyzing a polymer means obtaining information about the 

structure of the polymer such as its size, the order of its sequence sites, its relatedness to 
other polymers, the identity of its sequence sites, or its presence or absence in a sample. 
The structure of a polymer can reveal important information about its function since 
these parameters are generally interrelated in biological polymers. 

25 In some instances, tiie sample may contain multiple copies of the same polymer. 

Such a sample is considered to be homogeneous. Polymers in homogenous samples are 
identical in length and sequence. Even a homogeneous sample however will give rise to 
two types of profiles: a "direct" profile and an "inverted" profile. This is because most, 
if not all, polymer analysis systems are orientation-insensitive. As a result, each polymer 

30 has an equal chance of beiag analyzed in a "head-first" orientation (resulting in a "direct" 
profile) or in a "tail-first" orientation (resulting in an "inverted" profile"). When the 
profiles from each polymer are combined, the resulting profile (referred to herein as a 
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"population profile" to distinguish it fixnn an "individual polymer profile") contains 
signals (or peaks) from the direct and the inverted profiles. Prior to the invention, it was 
difficult to discem direct profile signals from inverted profile signals. 

In other instances, the sample may contain multiple copies of a plxirality of 

5 polymers, wherein each of the plurality of polymers is different. As used herein, 

different polymers are polymers that differ in length and/or sequence. Examples include 
fi-agments of a larger polymer such as restriction fi-agments of a parent polymer or 
sheared genomic DN A, and mRNA transcripts expressed in a cell or tissue. "Different 
polymers" may however share some sequence identity, provided that they are not 100% 

10 identical with respect to theh sequence. In the case of heterogeneous samples, the 

combined population profile is even more complex since it contains direct and inverted 
profiles from more than one polymer type. 

The invention provides methods for manipulating and processing the signals and 
profiles from homogeneous and heterogeneous samples, ha its simplest form, it provides 

15 methods for discerning direct profiles from inverted profiles in a homogeneous sample. 
It can also accomplish this for a given polymer in a heterogeneous sample. In a more 
complex form, it discerns different polymers from each other as well as distinguislung 
direct and inverted profiles for each polymer type. 

The polymer being analyzed (sometimes referred to herein as the "target" 

20 polymer) may be free flowing or it may be fixed to a solid support. In a fixed 

conformation, the polymer is attached to a soHd support at one or multiple attachment 
points. The nature of the soHd support is not limiting to the invention. The sohd support 
may be any surface to which the polymer can be attached without comprising its 
integrity. Various types of soHd supports are available (including microchips, beads and 

25 the Uke), of which the art is femiUar. When fixed it a solid support, the polymer is 
immobile. In this latter embodiment, the intenogation and/or detection station of a 
polymer analysis system may move relative to the polymer. In a flow conformation, the 
polymer is able to move in a fluid, preferably through an interrogation station within the 
polymer analysis system. The polymer may also be attached to a support that is itself 

30 mobile, such as for example a free flowing bead. 
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Another immobilization approach involves the use of polymers trapped in a gel 
matrix. Stretching of the polymer is accomplished through the use of an electric field, 
for example. 

Li the absence of directional labeling (e.g., end specific labeling) of the polymer, 
5 it is difficult to determine the direction in which the polymer is analyzed since the 

polymer analysis system is orientation-insensitive. As a result, polymers are expected to 
orient themselves randomly with approximately equal nxmibers being analyzed head-first 
and tail-first, regardless of whether they are provided in free flowing or fixed 
conformations. 

10 The methods provided herein generally involve several data processing steps. 

These include alignment of individual polymer profiles, compilation of individual 
profiles to form population profiles, selection of individual signals (or peaks) firom the 
population profile, extraction of individual profiles that contribute to the selected signal 
(or peak), compilation of these latter individual profiles to yield a "peak profile", and 

15 comparison of the peak profile with the population profile. This latter comparison yields 
information regarding the oriented nature of the subset of polymers giving rise to the 
peak in the population profile. For example, this subset of polymers may itself comprise 
direct and inverted polymer profiles. More preferably, this subset of polymers comprises 
polymers oriented in one direction (e.g., all head-first or all tail-first). Each of these 

20 steps will be discussed in greater detail below. 

The polymers to be analyzed must be labeled in a sequence specific manner. It is 
this labeling that gives rise to the signals (or pealcs) which are later evaluated by the. 
methods of the invention. The polymer is generally labeled prior to analysis with the 
polymer analysis system. Polymer labeling will be discussed in greater detail below. 

25 Sequence specific labeling can be accomplished in any number of ways known in the art. 
In important embodiments, the polymer is labeled using a binding partner that binds to 
the polymer in a sequence specific manner. The most common example of a sequence 
specific binding partner for nucleic acid polymers is a nucleic acid probe. As used 
herein, a nucleic acid probe is a nucleic acid that hybridizes to the polymer being 

30 analyzed (i.e., the target polymer) at a site that is complementary to its own sequence. 
The terms "probe" and "tag" and '*unit specific marker" are used interchangeably herein. 
The nature of a nucleic acid probe will be described in greater detail herein. Briefly, it 
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can be of any length and of any sequence. The shorter the length, the greater the 
resolution that may be achieved. Usually, nucleic acid probes should be contacted to the 
target polymer imder conditions that promote hybridization between true complements 
(i.e., where each base of the probe is bound to its complementary base on the target 

5 polymer in a continuous and contiguous manner). These conditions are referred to 

herein as stringent conditions. The art is famihar with such conditions. (See for example 
Maniatis et al., Molecular Cloning: A Laboratory Manual. Cold Spring Harbor (1982).) 
The methods however are not limited to hybridization under stringent conditions and can 
be performed under conditions in which less than 100 % of the probe bases are bound to 

10 target polymer bases. 

The polymer may additionally be labeled in a sequence non-specific manner, as 
will also be discussed in greater detail below. In some instances, the non-specific labels 
are evenly distributed along the length of the polymer. An example of non-rspecific 
labels are stains that bind to the backbone of the target nucleic acid polymer. Preferably, 

15 the sequence non-specific labels uniformly label the polymer along its length and thus do 
not give rise to any intensity "peaks". Intensity peaks should derive solely fi:om the 
sequence specific labels described herein. 

The labeled polymers are analyzed using a polymer analysis system. These 
systems include interrogation and detection stations that serve to stimulate a signal from 

20 a polymer (or a probe bound thereto) and to detect the resultant signal, respectively. 
Preferably, the polymer analysis system is capable of analyzing single polymers. Even 
more preferably, they analyze the polymer linearly and are therefore referred to as linear 
single polymer analysis systems. Such systems are discussed in greater detail below. An 
exemplary polymer analysis system is the GeneEngine described in U.S. Patent No. 

25 6,355,420 Bl, issued March 12, 2002, the entire contents of which are incorporated by 
reference herein. 

The polymer analysis system analyzes individual polymers starting from one end 
of the polymer and moving along the polymer length towards the opposite end. In the 
process, signals are recorded as a fimction of their position or location on the polymer. 
30 The sum total of signals for a given polymer is then plotted as a fimction of position on 
the polymer. This plot is referred to herein as a profile. If the profile derives from 
analysis of a single copy of a polymer, then it is referred to herein as an "individual 



wo 2004/066185 



PCT/US2004/001823 



-10- 

proj&le". If instead the profile derives from the combination (or compilation) of a 
plurality of individual profiles, then it is referred to as a "population profile". As will be 
discussed below, the population profile may be oriented or non-oriented. As used herein, 
profiles are also referred to as "intensity profiles" since they reflect label intensity along 

5 the length of the polymer. Labels will be discussed in greater detail below. 

Once obtained, individual polymer profiles are aligned relative to each other in 
order to facilitate tlieir superimposition. Alignment is performed using an alignment 
reference point. An alignment reference point is an identical site present in each 
analyzed polymer of a given type. The alignment reference point may be internal to the 

10 polymer (i.e., an internal alignment reference point) or it may be at an end of a polymer 
(i.e., a terminal alignment reference point). It may be sequence dependent or sequence 
independent, depending on the polymer. Furthermore, it may be intrinsically detectable 
or it may be detected through the use of an extrinsic probe, for example. Accordingly, 
the reference point may be visualized through the binding of a sequence specific probe or 

15 a sequence non-specific probe to individual polymers. 

As will be discussed in greater detail below, the method uses two reference 
points. One reference point is used to align individual profiles in order to generate a 
population profile (i.e., the alignment reference point) and the other reference point is 
used to determine orientation of individual profiles (i.e., the orientation reference point). 

20 The orientation reference point is preferably an internal reference point. More 

preferably, it is the center of the molecule (or center of the polymer). The center of the 
molecule can be determined by labeling the polymer uniformly along its length with for 
example a length proportional dye or stain, estimating the length of the polymer based on 
the length of the intensity profile and thereby determining the midpoint or center of the 

25 molecule. The center of the molecule is a suitable reference point regardless of the 

stretching characteristics of the target polymer (i.e., the center of the molecule may still 
be determined even if the target polymer is not uniformly and completely stretched.) For 
example, it is possible that one or both ends of the polymer are compacted to an extent 
that precludes linear polymer analysis in these regions. Regardless, if the polymer is ' 

30 labeled with a length proportional label, these compacted areas are still usefiil for 

determining the center of the molecule (i.e., the signal from these compacted areas is still 
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indicative of the length of polymer therein and can be used together with the more linear 
portions of the polymer to determine the midpoint of the polymer). 

The reference point may also be an origin of replication, a transcriptional 
promoter, a centromere, a highly repetitive sequence, and the like. 
5 The method preferably uses stretched linear polymers in order to maximize the 

amount of sequence information that can be attained. Non-linear and/or coiled regions of 
the target polymer are less useful for determining sequence. The polymer may be 
uniformly stretched along its length, or it may contain regions that are more or less 
stretched than other regions along its length. In either case, the polymer and/or regions 

10 within the polymer may be maximally stretched. The polymer can also be less than 
maximally stretched. Thus if maximum stretching is referred to as 100% stretched (see 
below for definition of maximal stretching), then the polymer may also be at least 50%, 
at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, or at least 99% 
stretched. In important embodiments, the polymer is uniformly but not maximally 

15 stretched. As will be described in the Examples, polymers having stretched and 
compacted regions are still usefiJ. 

Although the specification refers to polymer length (for example, on the x-axis of 
Figs.l and 2), the invention similarly relates to polymer contour length Lc. Polymer 
length as shown in the intensity vs. length plots (and thus profiles) represents the 

20 polymer projection in the direction of flow or other stretching force. Actual polymer 
length is the length of the polymer backbone or contour length (Lc) (i.e., the length per 
nucleotide times the number of nucleotides, independent of polymer conformation). For 
B-form DNA, the length per nucleotide is 0.34 nm. The measured length and contour 
length are equal when the polymer is maximally stretched (i.e., 100% stretched). Thus, 

25 the ratio of measured length to contour length is indicative of tlie extent of stretching of 
the polymer. 

Sequence non-specific labeling, such as intercalation, changes the contour length 
by expanding the DNA. However, Lc can be defined even for such "swollen" DNA and 
still be lised as to determine the extent of stretching. Overstretched.DN A is essentially 
30 denatured and should be avoided. 

Once aligned, the individual profile can be combined to yield a population 
profile. "Combining" individual profiles as used herein means that the aligned and 
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possibly superimposed profiles are added together (i.e., intensity values at a given 
position from one profile are added to corresponding intensity values at the same 
position in another profile). The population profile may be a cumulative profile, 
meaning that it represents the sum total of all intensity values as a fiinction of position 

5 along the polymer. Alternatively, it may be an averted or normalized profile, meaning 
that it represents the averaged or normalized intensity value as a fimction of position 
along the polymer. The averaged or normalized profile is obtained by dividing the 
intensity values on a cumulative profile by the number of contributing profiles. 
Importantly, the population profiles may derive from individual profiles from identical 

10 and/or different polymers, both of which will contribute direct and inverted profiles. As 
used herein, a sample population profile is the profile that combines all individual 
profiles obtained from a sample and thus should include signals from all labeled 
polymers in the sample. 

The data from such analyses is generally combined in order to achieve higher 

15 signal to noise ratios than would be possible by analyzing a single polymer. 

Additionally, combining individual profiles yields the complete pattern of sequence 
specific target sites on a polymer. Individxial profiles may only provide signals for a 
subset of target sites. Moreover, they may also include probes boxmd at incorrect sites 
(i.e., mismatched probes). This is because binding of nucleic acid probes to a nucleic 

20 acid polymer is generally less than 100% efficient and specific (e.g., hybridization 

efficiency may range from 50% to 95% and hybridization specificity may range from 2 
to 20). Hybridization specificity is the mtio of the proportion of correctly labeled target 
sites to the proportion of incorrectiy labeled sites. In addition, not every probe is 
detected. For example, probes with one or few detectable labels on them are less likely 

25 to be detected. Detection of single fluorophores at a high rate has an average efficiency 
10-90% and depends upon the properties of fluorophore used as well as on the trajectory 
of the probe and polymer through the excitation spot of the polymer analysis system. 

Population profiles generally contain twice as many signals (or pealcs) as the 
number of actual sequence specific sites on the target polymer. This is because at a 

30 minimiim the population profile is made up of direct and inverted individual profiles. If 
sequence information is desired (e.g., in order to generate a sequence map), then it is 
desirable to separate direct and inverted profiles from each other. Population profiles can 
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also be used as identijfiers for a particular polymer and as such are referred to as barcodes 
or fingerprints of the polymer. In preferred embodiments, the barcode or fingerprint is 
an oriented population profile (i.e., a population profile in which all contributing 
individual profiles are oriented in the same direction, either head-first or tail-first). 

5 However, the population profile can also serve as an identifier even if in non-oriented 
form, in some cases. 

Once the population profile is formed, individual peaks in the profile are further 
analyzed. It is desirable to select individual peaks that are formed fi-om a subset of 
oriented iiidividual profiles. Such peaks may be selected randomly or based on a 

10 particxilar parameter, such as intensity level. For example, in some cases, lower (but still 
above backgroimd) intensity peaks are more likely to represent a subset of oriented 
individual profiles. Once such a pealc is identified, the individual profiles that 
contributed to that peak are extracted fi-om the data set. The extracted individual profiles 
should all comprise a peak identical to the selected peak firom the population profile. 

15 Thus, in some instances, peaks that correspond to oriented profiles can be 

identified as such if (a) the pealc profile is asymmetric and has pealcs at fewer positions 
than the sample population profile, and (b) the combination of the direct and uiverted 
peak profiles is identical to the symmetric population profile. These criteria are valid for 
homogeneous samples that include one polymer in two orientations. 

20 In the case of a mixture of different polymers, the first criterion remains the same 

(although in this case the profile need not be asymmetric), but the second is not 
necessarily fiilfilled. Once the mixture is separated into the profiles of different 
polymers, each of those profiles can be analyzed as described above to extract an 
oriented profile. However, in some cases, the profiles of different polymers may be 

25 akeady extracted in oriented form. This will depend on the complexity of the sample 
and profile, as well as the positioning and interference of individual pealcs of different * 
polymers. 

It is to be understood that as used herein, "identical peaks" mean two or more 
peaks that are positioned identically along the length of the polymer (and 
30 correspondingly, along the length of the profile). Identical pealcs may vary however in 
their intensity depending on whether the profile is an individual profile or a population 
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profile. Additionally, there will be variations in the intensities of identical peaks 
between individual profiles. 

Extracted individual profiles are then combined (as described herein) to yield a 
peak profile. The pealc profile is therefore a population profile since it is made up of a 

5 plurality of individual profiles. The peak profile however is derived from only a subset 
of individual profiles as compared to the sample population profile which represents all 
profiles obtained from the sample. The pealc profile is then compared to the sample 
population profile. Depending upon the nature of the sample and the desired degree of 
analysis, comparison of peak profiles to the sample population profile can take various 

10 forms and iterations. It is to be understood that although the description provided herein 
describes the comparison of a single peak profile with the sample population profile, 
comparison of a plurality of peak profiles and potentially all pealc profiles may also be 
carried out by successive or concurrent iterations of the method. As will be apparent to 
one of ordinary skill in the art, such data manipulations can be performed using a 

15 computer. 

If the peak profile resembles the sample population profile, this indicates that the 
peak profile is likely derived from individual profiles in both orientations (i.e., it is a 
non-oriented peak profile). As used herein, a peak profile that "resembles" a population 
profile consists of peaks that are present in the population profile. As stated above, 

20 identical peaks are peaks that are present at the same position along the polymer, 
regardless of their intensity. 

If however the peak profile consists of only a subset of the peaks present in the 
sample population profile, then this suggests that the peak profile may derive from 
oriented individual profiles. If necessary, this can be confirmed in a number of ways. In 

25 an important embodiment, it is confirmed by inverting the pealc profile, combining the 
direct and inverted peak profiles to yield a non-oriented peak profile, and comparing the 
non-oriented peak profile with the sample population profile. A non-oriented peak 
profile that consists of peaks that are all present in the sample population profile 
confirms the oriented nat\ire of the originally selected peak profile and the individual 

30 profiles and polymers giving rise thereto. If tlie non-oriented population profile is 
identical to the sample population profile, this may fiirflaer indicate that the sample is 
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homogeneous. When a non-oriented peak profile is identical to a population profile, the 
two profiles consist of identically situated peaks. 

Orienting polymers according to the invention may be performed using a one or a 
two step process, as described above. If a two step process is used, then pealc profiles 
5 that appear to be oriented are referred to as "putative" oriented peak profiles since their 
oriented nature remains to be confirmed via the second step in the process. 

Another method for confirming the oriented nature of a putative oriented peak 
profile is to determine whether individual peaks present in the peak profile have 
corresponding mirror images in the population profile when the alignment reference 

10 point is a center of molecule reference point As used herein, a mirror image peak is a 
peak that exists distal to the center of molecule reference point and at a distance from the 
center of molecule reference point identical to the distance between the center of the 
molecule and the peak in question. For example, consider a peak that exists 20 microns 
to the left of the center of molecule reference point. Its mirror image would exist 20 

15 microns to the right of the center of molecule reference point. 

Yet another method for confirming the oriented nature of a putative oriented peak 
profile involves determining whether the putative oriented peak has a corresponding 
mirror image pealc in the sample population profile when the alignment reference point is 
a center of molecule reference point. The mirror image pealc can be processing similarly 

20 to the originally selected peak. For example, the individxial profiles contributing to the 
mirror image peak can be extracted firom the population data set, and thereafter 
combined, as described herein, to generate a mirror image peak profile. The mirror 
image pealc profile can then be compared to the population profile in order to determine 
whether the profiles resemble or are identical to each other. Tlie mirror image pealc 

25 profile can also be compared to the inverted peak profile. If these latter profiles are 
identical, then the pealc profile is oriented. 

As described herein, the method selects peaks present in an intensity versus 
polymer length plot. This is intended to exempUfy the analysis, particularly since the 
Examples and corresponding Figures illustrate such peaks. However, it is likely that 

30 individual pealcs may not be as apparent experimentally, particularly when a sample of 
hundreds, or thousands, or milHons of polymers is being analyzed. Accordingly, the 
method is not necessarily limited to tiie use of observable and discernable peaks. Rather 



wo 2004/066185 



PCT/US2004/001823 



-16- 

it can be performed using bin counts. As used herein, a bin is a period of time in which 
the detection system collects signals from a polymer being analyzed. As an example, a 
bin may be 1 microsecond in duration, and 1000 consecutive bins may contain 
contiguous intensity data from one individual polymer. Each of the consecutive bins 

5 therefore corresponds to a position along the length of the polymer. Thus rather than 
using observable peaks, the method can be performed using bin counts (i.e., the number 
of signals such as photon counts) for one or more bins. Accordingly, as used herein, the 
term "peak" is meant to embrace observable and discemable increases in intensity on an 
intensity versus length plot as well as bin counts in one or more bins. In some instances, 

10 a peak may be defined by the signals (i.e., bin counts) falling into one or two, three, four, 
five or more consecutive bios. 

It is to be understood that the methods provided herein can be used to distinguish 
polymers according to size. However, in some embodiments, it may be preferable to 
distinguish polymers based on size prior to aligmnent. This can be done by sorting 

15 polymers (and/or their corresponding data sets) according to intensity versus length 
characteristics. 

The invention provides for additional data processing. In one embodiment, it 
may be desirable to remove signals deriving from an identified and oriented polymer 
from a sample population profile in order to discern signals from different polymers. In 

20 this way, the complexity of the sample population profile can be progressively reduced 
and/or the complexity of a sample can be determined. As xised herein, the complexity of 
a sample refers to the number of different polymer types contained in the sample. 
Accordingly, a sample that contains 100 different polymer types is more complex than a 
sample that contains 2 different polymers, regardless of how many copies of each 

25 polymer is present in the sample. 

ladividual profiles or a subset of individual profiles (such as for example an 
oriented peak profile) may be subtracted from the sample population profile. Similarly, 
the inverted peak profile may also be subtracted from the sample population profile in 
order to effectively remove all signals from a given polymer. In this way, signals from a 

30 given polymer are removed from the popxilation profile, thereby making it less complex 
and potentially allowing lower intensity peaks and/or profiles to be observed. As should 
be apparent to one of ordinary skill, if subtraction of the oriented and inverted peak 
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profiles from the population profile results in a population profile devoid of peaks, then 
this indicates that the population was homogeneous for one particular polymer. If, on the 
other hand, there are additional peaks remaining after the subtraction, then this indicates 
that more than one polymer is present in the sample. 
5 It is to be understood that subtraction of profiles from each other can only be 

accomplished when the profiles are of the same form (i.e., when both profiles are 
cumulative profiles or when both profiles are normalized or averaged profiles). 

A "polymer" as used herein is a compound having a linear backbone of 
individual units which are linked together by linkages. In some cases, the backbone of 

10 the polymer may be branched. Preferably the backbone is unbranched and linear. The 
term "backbone" is given its usual meaning in the field of polymer chemistry. The 
polymers may be heterogeneous in backbone composition thereby containing any 
possible combination of polymer units liiiked together, such as peptide-nucleic acids 
(which have amino acids linked to nucleic acids and have enhanced stability). In one 

15 embodiment the polymers are, for example, nucleic acids, polypeptides, polysaccharides, 
or carbohydrates. In the most preferred embodiments, the polymer is a nucleic acid or a 
polypeptide. A polypeptide as used herein is a biopolymer comprised of linked amino 
acids. 

The polymer is made up of a plxirality of individual units. An "individual unit" as 
20 used herein is a building block or monomer which can be linked directly or mdirectiy to 
other building blocks or monomers to form a polymer. The polymer preferably is a 
polymer of at least two different linked units. The at least two different linked imits miay 
produce or be labeled to produce different signals. 

The polymer as well as the probes that bind the polymer can be nucleic acids. 
25 The term "nucleic acid" is used hereua to mean multiple nucleotides (i.e., molecules 

comprising a sugar (e.g., ribose or deoxyribose) linked to an exchangeable organic base, 
which is either a substituted pyrimidine (e.g., cytosine (C), thymidine (T) or xiracil (U)) 
or a substituted purine (e.g., adenine (A) or guanine (G)). As used herein, the terms refer 
to oligoribonucleotides as well as oligodeoxyribonucleotides. 
30 Nucleic acids can be obtained from existing nucleic acid sources (e.g., genomic 

or cDNA), or by synthetic means (e.g., produced by nucleic acid synthesis). Nucleic 
acids can be but are not limited to DNA and RNA. In important embodiments, the 
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polymer being analyzed is a DNA or RNA. The DNA may be a genomic DNA such as 
nuclear DNA or mitochondrial DNA. The DNA may also be cDNA. The RNA may be 
mRNA or rRNA but is not so limited. Nucleic acid polymers to be analyzed may be 
amplified in vitro prior to analysis in some embodiments, while in others the nucleic 

5 acids are non-amplified in vitro. 

Various modifications of nucleic acids are encompassed by the invention. 
Although not limiting, these usually apply to nucleic probes used to sequence the nucleic 
acid polymer. These modifications are described below. 

Nucleic acids shall also include polynucleosides (i.e., a polynucleotide minus a 

10 phosphate) and any other organic base containing polymer. The nucleic acids can 
include other non-naturally occurring substituted purines and pyrimidines such as C-5 
propyne modified bases (Wagner et al,, Nature Biotechnology 14:840- 844, 1996). 
Purines and pyrimidines include but are not limited to adenine, cytosine, guanine, 
thymidine, 5-methylcytosine, 2-aminopurine, 2-amino-6-chloropurine, 

15 2,6-diaminopurine, hypoxanthine, 2-thiouracil, pseudoisocytosine, and other naturally 
and non-naturally occurring nucleobases, and substituted and xmsubstituted aromatic 
moieties. Other such modifications are known to those of skill iti the art. 

The nucleic acids may also encompass substitutions or modifications, such as in 
the base and/or sugar moiety. For example, they include nucleic acids having backbone 

20 sugars which are covalently attached to low molecular weight organic groups other than 
a hydroxyl group at the 3' position and other than a phosphate group at the 5* position. 
Thus, modified nucleic acids may include a 2 -0-alkylated ribose group. In addition, 
modified nucleic acids may include sugars such as arabinose instead of ribose. 

The nucleic acids may be heterogeneous in backbone composition thereby 

25 containing any possible combination of polymer units linked together such as peptide 
nucleic acids (which have amino acid backbone with nucleic acid bases, and which are 
discussed in greater detail herein). In some embodiments, the nucleic acids are 
homogeneous in backbone composition. 

As used herein with respect to linked units of a polymer, "linked" or "linkage" 

30 means two entities are bound to one another by any physicochemical means. Any 

linlcage known to those of ordinary skill in the art, covalent or non-covalent, is embraced. 
Natural linkages, which are those ordinarily found in nature connecting the individual 
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units of a particular polymer, are most common. Natural linkages include, for instance, 
amide, ester and thioester linkages. The individual units of a polymer and/or probes may 
be linlced, however, by synthetic or modified linlcages. Polymers in which imits are 
linked by covalent bonds will be most common but may also include hydrogen bonded 
5 units, etc. 

Intensity data may be obtained by analyzing polymers having probes bound 
thereto. These probes are preferably sequence specific, "Sequence specific" when used 
in the context of a nucleic acid means that the probe recognizes a particular linear 
arrangement of nucleotides or derivatives thereof An analogous definition applies to 

10 non-nucleic acid polymers. In preferred embodiments, the linear arrangement includes 
contiguous nucleotides or derivatives Ihereof that each bind to corresponding contiguoxis 
complementary nucleotides on the target nucleic acid. In some embodiments, however, 
the sequence may not be contiguous as there may be one, two, or more nucleotides that 
do not have corresponding complementary residues in the target, 

15 It is to be understood that any nucleic acid analog that is capable of recognizing a 

nucleic acid with structural or sequence specificity can be used as a probe to label 
sequence sites on a polymer or to identify a reference point. In most instances involving 
a nucleic acid polymer, the probes will form at least a Watson-Crick bond with the 
polymer. In other instances, the probe can form a Hoogsteen bond with the nucleic acid 

20 polymer, thereby forming a triplex with the target nucleic acid polymer. A nucleic acid 
sequence that binds by Hoogsteen binding enters the major groove of its target and 
hybridizes with the bases located there. Examples of these Hoogsteen binding probes 
include molecules that recognize and bind to the minor and major grooves of nucleic 
acids (e.g., some forms of antibiotics). The probes may form both Watson-Crick and 

25 Hoogsteen bonds with the polymer. BisPNA probes, for instance, are capable of both 
Watson-Crick and Hoogsteen binding to a nucleic acid polymer. When used to identify 
polymer sequence, it is preferred that the probes have strong sequence specificity. 

The probe may be a peptide nucleic acid (PNA) and various forms thereof as 
described herein, a locked nucleic acid (LNA), DNA, RNA, or co-polymers of the above 

30 such as DNA-LNA co-polymers. 

PNAs are DNA analogs having their phosphate backbone replaced with 2- 
aminoethyl glycine residues linked to nucleotide bases through glycine amino nitrogen 
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and methylenecarbonyl liolcers. PNAs can bind to both DNA and RNA targets by 
Watson-Crick base pairing, and in so doing form stronger hybrids than would be possible 
with DNA or RNA based probes, 

Peptide nucleic acids are synthesized fiom monomers connected by peptide 

5 bonds (Nielsen, P.E. et al.. Peptide Nucleic Acids, Protocols and Applications. Norfolk: 
Horizon Scientific Press, p. 1 -1 9 (1 999)). They can be built with standard solid phase 
peptide synthesis technology. 

PNA chemistry and synthesis allows for inclusion of amiuo acids and polypeptide 
sequences in the PNA design. For example, lysine residues can be used to introduce 

10 positive charges in the PNA backbone. All chemical approaches available for the 
modifications of amino acid side chams are directly applicable to PNAs. 

PNA has a charge-neutral backbone, and this attribute leads to fast hybridization 
rates of PNA to DNA (Nielsen, P.E. et al. Peptide Nucleic Acids. Protocols and 
Applications, Norfolk: Horizon Scientific Press, p. 1-19 (1999)). The hybridization rate 

15 can be fiirther increased by iatroducing positive charges in the PNA structure, such as in 
the PNA backbone or by addition of amino acids with positively charged side chains 
(e.g., lysines). PNA can fonn a stable hybrid with DNA molecule. The stability of such 
a hybrid is essentially independent of the ionic strength of its environment (Orum, H. et 
al., BioTechniques 19(3):472-480 (1995)), most probably due to the uncharged nature.of 

20 PNAs. This provides PNAs with the versatility of being used in vivo or in vitro. 

However, the rate of hybridization of PNAs that include positive charges is dependent on 
ionic strength, and thus is lower in the presence of salt. 

Several types of PNA designs exist, and these include single strand PNA 
(ssPNA), bisPNA, pseudocomplementary PNA (pcPNA). 

25 The structure of PNA/DNA complex depends on the particular PNA and its 

sequence. Single stranded PNA (ssPNA) binds to ssDNA preferably in antiparallel 
orientation (i.e., with the N-temainus of the ssPNA aligned with the 3' terminus of the 
ssDNA) and with a Watson-Crick pairing. PNA also can bind to DNA with a Hoogsteen 
base pairing, and thereby forms triplexes with dsDNA (Wittung, P. et al., Biochemistry 

30 36:7973 (1997)). 

Single strand PNA is the simplest of the PNA molecules. This PNA form 
interacts with nncleic acids to form a hybrid duplex via Watson-Crick base pairing. The 
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duplex has different spatial structure and higher stability than dsDNA (Nielsen, P.E. et 
al.. Peptide Nucleic Acids. Protocols and Applications, Norfollc: Horizon Scientific 
Press, p. 1-19 (1999)). However, when different concentration ratios are used and/or in 
the presence of compUmentary DNA strand, PNA/DNA/PNA or PNA/DNA/DNA 

5 triplexes can also be formed (Wittung, P. et al., Biochemistry 36:7973 (1997)). The 
formation of duplexes or triplexes additionally depends upon the sequence of the PNA. 
Thymine-rich homopyrimidine ssPNA forms PNA/DNA/PNA triplexes with dsDNA 
targets where one PNA strand is involved in Watson-Crick antiparallel pairing and the 
other is involved in parallel Hoogsteen pairing. Cytosine-rich homopyrimidine ssPNA 

10 preferably binds through Hoogsteen pairing to dsDNA forming a PNA/DNA/DNA 
triplex. If the ssPNA sequence is mixed, it invades the dsDNA target, displaces the 
DNA strand, and forms a Watson-Crick duplex. Polypurine ssPNA also forms triplex 
PNA/DNA/PNA with reversed Hoogsteen pairing. 

BisPNA includes two strands connected with a flexible linker. One strand is 

15 designed to hybridize with DNA by a classic Watson-Crick pairing, and the second is 
designed to hybridize with a Hoogsteen pairing. The target sequence can be short (e.g., 
8 bp), but the bisPNA/DNA complex is still stable as it forms a hybrid with twice as 
many (e.g., a 16 bp) base pairings overall. The bisPNA structure further increases 
specificity of their binding. As an example, binding to an 8 bp site with a probe having a 

20 single base mismatch results in a total of 14 bp rather than 16 bp. 

Although not intending to be bound by any particular theory, the bisPNA 
molecule is thought to bind to its target site first via its Hoogsteen strand, followed by 
the invasion of the Watson-Crick strand to form a triplex with one of the original DNA 
strands displaced. To facilitate the second step, the hybridization reaction is performed 

25 at elevated temperature to increase the frequency of DNA helix opening (i.e., locaUzed 
melting). That mechanism increases the overall hybridization rate dramatically, since at 
the moment of DNA opening, the Watson-Crick strand of bisPNA is positioned to invade 
the helix. 

Preferably, bisPNAs have homopyrimidine sequences, and even more preferably, 
30 cytosiues are protonated to form a Hoogsteen pair to a guanosine. Therefore, bisPNA 
with thymines and cytosines is capable of effective hybridization to DNA only at pH 
below 6.5. The first restriction - homopyrimidine sequence only - is inherent to the 



wo 2004/066185 



PCTAJS2004/001823 



-22- 

mode of bisPNA binding. Pseudoisocytosine (J) can be used in the Hoogsteen strand 
instead of cytosine to allow its hybridization through a broad pH range (Kuhn, H., 7. 
Mol Biol 286:1337-1345 1999)). 

BisPNAs have multiple modes of binding to nucleic acids (Hansen, G.L et al,, J. 
5 Mol Biol 307(l):67-74 (2001)). One isomer includes two bisPNA molecules instead of 
one. It is formed at higher bisPNA concentration and has tendency to rearrange into the 
complex with a single bisPNA molecule. Other isomers differ in positioning of the 
Imker around the target DNA strands. All the identified isomers still bind to the same 
binding site/target. 

10 Pseudocomplementary PNA (pcPNA) (Izvolsky, K.L et al.. Biochemistry 39: 

10908-10913 (2000)) involves two smgle stranded PNAs added to dsDNA. One pcPNA 
strand is complementary to the target sequence, while the other is complementary to the 
displaced DNA strand. As the PNA/DNA duplex is more stable, the displaced DNA 
generally does not restore the dsDNA structure. The PNA/PNA duplex is more stable 

15 than the DNA/PNA duplex and the PNA components are self-complementary because 
they are designed against complementary DNA sequences. Hence, the added PNAs 
would rather hybridize to each other. To prevent the self-hybridization of pcPNA units, 
modified bases are used for their synthesis including 2,6-diamiopurine (D) instead of 
adenine and 2-thiouracil (^U) instead of thymine. While D and are still capable of 

20 hybridization with T and A respectively, their self-hybridization is sterically prohibited. 

This PNA construct also delivers two base pairs per every nucleotide of the target 
sequence. Hence, it can bind to short sequences similar to those that are bisPNA targets. 
The pcPNA strands are not connected by a hinge, and they have different sequences. 
Hybridization of pcPNA can be less efficient than that of bisPNA because it 

25 needs three molecules to form the complex. However, the pseudocomplementary stands 
can be connected by a suflaciently long and flexible hinge. 

Another bisPNA-based approach involves use of the displaced DNA strand 
(Demidov, V.V. et al., Methods: A Companion to Methods in Enzymology 23(2): 123- 
131 (2001)). If the second bisPNA is hybridized close enough to the first one, then a run 

30 of DNA (up to 25 bp) is displaced, forming an extended P-loop. This run is long enough 
to be tagged. This combination is referred to as a PD-loop (Demidov, V.V. et al., 
Methods: A Companion to Methods in Erizymology23(2):\23-'\3\ (2001)), Other 
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applications for the opening are also designed including topological labels or "earrings". 
Tagging beised on PD-loop has important advantages, including increased specificity. 

In some embodiments, positive charges are incorporated into a probe such as a 
PNA based probe in order to improve the interaction between the probe and the polymer. 
5 Such modification increases the hybridization rate due to electrostatic attraction of the 
positively charged probe and the negatively charged backbone of the nucleic acid 
polymer. 

Locked nucleic acid (LNA) molecules form hybrids with DNA, which are at least 
as stable as PNA/DNA hybrids (Braasch, D.A. et al., Chem & Biol 8(1): 1-7(2001)). 

10 Therefore, LNA can be used just as PNA molecules would be. LNA binding efficiency 
can be increased in some embodiments by adding positive charges to it. LN As have 
been reported to have increased binding afBnity inherently. 

Commercial nucleic acid synthesizers and standard phosphoramidite chemistry 
are used to make LNA oligomers. Therefore, production of mixed LNA/DNA sequences 

15 is as simple as that of mixed PNA/peptide sequences. The stabilization effect of LNA 
monomers is not an additive effect. The monomer influences conformation of sugar 
rings of neighboring deoxynucleotides shifting them to more stable configurations 
(Nielsen, P.E. et al.. Peptide Nucleic Acids. Protocols and Applications, Norfolk: 
Horizon Scientific Press, p. 1-19 (1999)). Also, lesser nimiber of LNA residues in the 

20 sequence dramatically improves accuracy of the synthesis. Naturally, most of 

biochemical approaches for nucleic acid conjugations are applicable to LNA/DNA 
constructs. 

The probes can also be stabilized in part by the use of other backbone 
modifications. The invention intends to embrace in addition to the peptide and locked 

25 nucleic acids discussed herein, tlie use of the other backbone modifications such as but 
not limited to phosphorothioate linkages, phosphodiester modified nucleic acids, 
combinations of phosphodiester and phosphorothioate nucleic acid, methylphosphonate, 
alkylphosphonates, phosphate esters, alkylphosphonothioates, phosphoramidates, 
carbamates, carbonates, phosphate triesters, acetamidates, carboxymetliyl esters, 

30 methylphosphorothioate, phosphorodithioate, p-ethoxy, and combinations thereof 
Other backbone modifications, particularly those relating to PNAs, include 
peptide and amino acid variations and modifications. Thus, the backbone constituents of 
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PNAs may be peptide linkages, or alternatively, they may be non-peptide linkages. 
Examples include acetyl caps, amino spacers such as 0-linlcers, amino acids such as 
lysine (particularly useful if positive charges are desired in the PNA), and the lilce. 
Various PNA modifications are known and probes incorporating such modifications are 
5 commercially available &om sources such as Boston Probes, Inc. 

One limitation of the stability of nucleic acid hybrids is the length of the probe, 
with longer probes leading to greater stability than shorter probes. Notwithstanding this 
proviso, the probes can be any length ranging from at least 4 nucleotides long to in 
excess of 1000 nucleotides long. In preferred embodiments, the probes are 6-100 

10 nucleotides in length, more preferably between 5-25 nucleotides in length, and even 

more preferably 5-12 nucleotides in length. The length of the probe caa be any length of 
nucleotides between and including the ranges listed herein, as if each and every length 
was explicitly recited herein. It should be understood that not all residues of the probe 
need hybridize to complementary residues in the target nucleic acid molecule. For 

15 example, the probe may be 50 residues in length, yet only 25 of those residues hybridize 
to the nucleic acid polymer. Preferably, the residues that hybridize are contiguous with 
each other. 

The probes recognize and bind to sequences within the target polymer. If the 
polymer is itself a nucleic acid molecule, then the probe preferably recognizes and binds 

20 by hybridization to a complementary sequence within the target polymer. The specificity 
of binding can be manipulated based on the hybridization conditions. For example, salt 
concentration and temperature caa be modulated in order to vary the range of sequences 
recognized by the probes. 

The probes are preferably single stranded, but they are not so limited. For 

25 example, when the probe is a bisPNA it can adopt a secondary structure with the nucleic 
acid polymer resulting in a triple helix conformation, with one region of the bisPNA 
clamp forming Hoogsteen bonds with the backbone of the target polymer and another 
region of the bisPNA clamp forming Watson-Crick bonds with tlae nucleotide bases of 
the target polymer. 

30 Polymer analysis according to the invention encompasses detecting signals 

intrinsically present in a polymer or signals from an extrinsic probe that is boimd to the 
polymer. The signals in turn derive from labels or detectable moieties. The "label" or 
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"detectable moiety" may be, for example, light emitting, energy accepting, fluorescing, 
radioactive, quenching, and the like, as the invention is not limited in this respect. Many 
naturally occurring units of a polymer are light emitting compounds or quenchers, and 
thus are intrinsically labeled. Both types of labels are usefiil according to the methods of 

5 the invention. Guidelines for selecting the appropriate labels, and methods for adding 
extrinsic labels to polymers are provided in more detail in US 6,355,420 Bl. 

The label or detectable moiety can be directly or indirectly detected. A directly 
detectable moiety is one that can be detected directly by its ability to emit and/or absorb 
light of a particular wavelength. An indirectly detectable moiety is one that can be 

10 detected indirectly by its ability to bind, recruit and, in some cases, cleave another 

moiety which itself may emit or absorb light of a particular wavelength. An example of 
indirect detection is the use of a jSrst enzyme label which cleaves a substrate into directly 
detectable products. The label may be organic or inorganic in nature. For example, it 
may be chemical, peptide or nucleic acid in nature although it is not so limited. Labels 

15 can be conjugated to a polymer or probe using thiol, amino or carboxylic groups. 

The labels described herein are referred to according to the systems by which 
they are detected. As an example, a fluorophore molecule is a molecule that can be 
detected using a system of detection that reUes on fluorescence. 

Generally, the label can be selected from the group consisting of an electron spin 

20 resonance molecule (such as for example nitroxyl radicals), a fluorescent molecule (i.e., 
fluorophores), a chemiluminescent molecule (e.g., chemiluminescent substrates), a 
radioisotope, an optical or electron density marker, an enzyme, an enzyme substrate, a 
biotin molecule, a streptavidin molecule, an electrical charge transferring molecule (i.e., 
an electrical charge transducing molecule), a chromogenic substrate, a semiconductor 

25 nanocrystal, a semiconductor nanoparticle, a colloid gold nanocrystial, a Ugand, a 
microbead, a magnetic bead, a paramagnetic particle, a quantuna dot, a chromogenic 
substrate, an affinity molecule, a protein, a peptide, nucleic acid, a carbohydrate, an 
antigen, a hapten, an antibody, an antibody fragment, and a lipid. They are not so limited 
however. 

30 Examples of labels include fluorophores such as fluorescein (e.g., fluorescein 

succinimidyl ester), TRITC, rhodamine, tetramethylrhodamine, R-phycoerythrin, Cy-3, 
Cy-5, Cy-7, Texas Red, Phar-Red, allophycocyanin (APC); radioactive isotopes such as 



wo 2004/066185 



PCTAJS2004/001823 



-26- 

P^^ or H^; epitope or affinity molecules such as FLAG and HA epitope; and enzymes 
such as alkaline phosphatase, horseradish peroxidase and p-galactosidase. Also 
envisioned is the use of semiconductor nanocrystals such as quantum dots, described in 
U.S. Pat No. 6,207,392, as labels. Quantum dots are cormnercially available from 

5 Quantum Dot Corporation. The labels may be directly linked to the DNA bases or may 
be secondary or tertiary units linked to modified DNA bases. 

Antibodies can be used according to the invention as probes as well as labels. 
Thus, polymers can be labeled using antibodies or antibody fragments and optionally 
their corresponding antigens, haptens or epitopes. In the latter embodiment, the antigen, 

10 hapten, or epitope may itself be labeled. Detection of bound antibodies is accompHshed 
by techniques known to those skilled m the art. Antibodies bound to polymers can be 
detected by linking a label to the antibodies and then observing the site of the label. If 
antibody binding indicates sequence information, then the antibody should bind to the 
polymer in a sequence specific manner. If antibody binding indicates merely the 

15 presence of the polymer (e.g., represents the backbone of the polymer, as discussed 
below), then the antibody need not bind to the polymer in a sequence specific manner. 
In addition to the use of antigens, haptens and epitopes, antibodies can also be visualized 
using secondary antibodies or fragments thereof that are specific for the primary 
antibody. Polyclonal and monoclonal antibodies may be used. Antibody fragments 

20 include Fab, F(ab)2, Fd and antibody fragments which include a CDR3 region. 

In some embodiments, the polymer and/or probes are labeled with detectable 
moieties that emit distinguishable signals that can all be detected by one type of 
detection system. For example, the detectable moieties can all be fluorescent labels or 
they can all be radioactive labels. In other embodiments, the polymers and/or probes are 

25 labeled with moieties that are detected using different detection systems. For example, 
one polymer or vmit may be labeled with a fluorophore while another may be labeled 
with a radioactive isotope. 

In some instances, it may be desirable to fiirther label the polymer with a standard 
marker. The standard marker may be used to identify the polymer including defining, 

30 but not distinguishing between, its ends. For example, tlie standard marker may be a 
backbone label. One subset of backbone labels for nucleic acids are nucleic acid stains 
that bind nucleic acids in a sequence independent manner. Examples include 
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intercalating dyes such as phenanthridines aad acridines (e.g., ethidium bromide, 
propidium iodide, hexidium iodide, dihydroethidium, ethidium homodimer-1 and -2, 
ethidium monoazide, and ACMA); miaor grove binders such as indoles and imidazoles 
(e.g., Hoechst 33258, Hoechst 33342, Hoechst 34580 and DAPI); and miscellaneous 
5 nucleic acid stains such as acridine orange (also capable of intercalating), 7- AAD, 
actinomycin D, LDS751, and hydroxystilbamidine. All of the aforementioned nucleic 
acid stains are commercially available from suppliers such as Molecular Probes, Inc. 

Still other examples of nucleic acid stains include the following dyes from 
Molecular Probes: cyanine dyes such as SYTOX Blue, SYTOX Green, SYTOX Orange, 

10 POPO-1, POPO-3, YOYO-1, YOYO-3, TOTO-1, TOTO-3, JOJO-1, LOLO-1, BOBO-1, 
BOBO-3, PO-PRO-1, PO-PRO-3, BO-PRO-l, BO-PRO-3, TO-PRO-1, TO-PRO-3, TO- 
PRO-5, JO-PRO-1, LO-PRO-1, YO-PRO-1, YO-PRO-3, PicoGreen, OUGreen, 
RiboGreen, SYBR Gold, SYBR Green I, SYBR Green H, SYBR DX, SYTO-40, -41, - 
42, ^3, -44, -45 (blue), SYTO-13, -16, -24, -21, -23, -12, -11, -20, -22, -15, -14, -25 

15 (green), SYTO-81, -80, -82, -83, -84, -85 (orange), SYTO-64, -17, -59, -61, -62, -60, -63 
(red). 

In some instances, the detectable labels are part of a FRET system with 
fluorescence signals dependent upon the proximal location of donor and acceptor 
molecules. Preferably, fluorescence arises when donor and acceptor molecules are 
20 proximally located to each other. 

Length-proportional DNA labeling also can be performed using the Label IT® kit 
which is commercially available from Minis (Madison, WI). The kit covalently attaches 
different fluorophores to DNA. The fluorophores are rhodamine, fluorescein, Cy3™ and 
Cy5™. 

25 The polymers are analyzed using polymer analysis systems. As a polymer is 

analyzed, the detectable labels attached to it are detected in either a sequential or 
simultaneous manner. A linear polymer analysis system is a system that analyzes 
polymers in a sequential or linear manner (i.e., starting at one location on the polymer 
and then proceeding linearly in either direction therefrom). When detected 

30 simultaneously, the signals usually form an image of the polymer, from which distances 
between labels can be determined. When detected sequentially, the signals are viewed in 
histogram (signal intensity vs. time), that can then be translated into a profile such as 
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those discussed herein, with knowledge of the velocity of the polymer. It is to be 
understood that in some embodiments, the polymer is attached to a solid support, while 
in others it is free flowing. In either case, the velocity of the polymer as it moves past, 
for example, an interaction and/or detection station will aid in determining the position 

5 of the labels, relative to each other and relative to other detectable markers that may be 
present on the polymer. 

Accordingly, preferable polymer analysis systems are able to deduce not only the 
total amount of label on a polymer, but perhaps more importantly, the location of such 
labels. The ability to detect, position, and orient profiles allows these profiles to be 

10 superimposed on other genetic maps, in order to orient and/or identify the regions of the 
genome being analyzed, for example. In preferred embodiments, the linear polymer 
analysis systems are capable of analyzing nucleic acid molecules individually (i.e., they 
are single molecule detection systems). 

An example of a suitable polymer analysis system is the Gene Engine™ system 

15 described in PCT patent appUcations WO98/35012 and WOOO/09757, published on 
August 13, 1 998, and February 24, 2000, respectively, and in issued U.S. Patent 
6,355,420 Bl, issued March 12, 2002. The contents of these applications and patent, as 
well as those of other applications and patents, and references cited herein are 
incorporated by reference in their entirety. This system allows single nucleic acid 

20 molecules to be passed through an interaction station in a linear manner, whereby the 
nucleotides in the nucleic acid polymer and/or the nucleic acid probe are interrogated 
individually in order to determine whettier there is a detectable label conjugated thereto. 
Interrogation involves exposing the nucleic acid to an energy source such as optical 
radiation of a set wavelength. In response to the energy soxirce exposure, the detectable 

25 label on the nucleotide (if one is present) emits a detectable signal. The mechanism for 
signal emission and detection will depend on the type of label sought to be detected. 

Other single molecule nucleic acid analytical methods which involve elongation 
of DNA molecule can also be used in the methods of the invention. These include 
optical mapping (Schwartz, D.C! et al., Science 262(5130):! 10-1 14 (1993); Meng, X. et 

30 al., Nature Genet 9(4):432-438 (1995); Jing, J. et al., Proc. Natl. Acad Set USA 

95(14):8046-8051 (1998); and Aston, C. et al.. Trends Biotechnol 17(7):297-302 (1999)) 
and fiber-fluorescence in situ hybridization (fiber-FISH) (Bensknon, A. et al., Science 
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265(51 8 1):2096-2098 (1997)), In optical mapping, nucleic acids are elongated in a fluid 
sample and fixed in the elongated conformation in a gel or on a sxirface. Restriction 
digestions are then performed on the elongated and fixed nucleic acids. Ordered 
restriction maps are then generated by determining the size of the restriction fragments. 
5 In fiber-FISH, nucleic acids are elongated and fixed on a surface by molecular combing. 
Hybridization with fluorescently labeled probe sequences allows determination of 
sequence landmarks on the nucleic acids. Both methods require fixation of elongated 
nucleic acids so that molecular lengths and/or distances between markers can be 
measured. Pulse field gel electrophoresis can also be used to analyze the labeled nucleic 

10 acids. Pulse field gel electrophoresis is described by Schwartz, D.C. et al.. Cell 

37(l):67-75 (1984). Other nucleic acid analysis systems are described by Otobe, K. et 
al.. Nucleic Acids Res, 29(22):E109 (2001), Bensimon, A. et al. in U.S. Patent 6,248,537, 
issued June 19, 2001, Herrick, J. et al.. Chromosome Res, 7(6):409:423 (1999), Schwartz 
in U.S. Patent 6,150,089 issued November 21, 2000 and U.S. Patent 6,294,136, issued 

1 5 September 25, 200 1 . Other linear polymer analysis systems can also be used, and the 
invention is not intended to be limited to solely those Hsted herein. 

The nature of such detection systems will depend upon the nature of the 
detectable moiety attached to the polymer. The detection system can be selected from 
any number of detection systems known in the art. These include an electron spin 

20 resonance (ESR) detection system, a charge coupled device (CCD) detection system, an 
avalanche photodiode (APD) detection system, a photomultiplier (PMT) detection 
system, a fluorescent detection system, an electrical detection system, a photographic 
film detection system, a chemiluminescent detection system, an enzyme detection 
system, an atomic force microscopy (AFM) detection system, a scanning tunneling 

25 microscopy (STM) detection system, an optical detection system, a nuclear magnetic 
resonance (NMR) detection system, a near field detection system, and a total internal 
reflection (TIR) detection system, many of which are electromagnetic detection systems. 

Other interactions involved in methods of the invention will produce a nuclear 
radiation signal. As a radiolabel on a polymer passes through the defined region of 

30 detection, nuclear radiation is emitted, some of which will pass through the defined 
region of radiation detection. A detector of nuclear radiation is placed in proximity of 
the defined region of radiation detection to capture emitted radiation signals. Many 
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methods of measuring nuclear radiation are known in the art including cloud and bubble 
chamber devices, constant current ion chambers, pulse counters, gas coxmters (i.e., 
Geiger-Miiller counters), solid state detectors (surface barrier detectors, lithium-drifted 
detectors, intrinsic germanium detectors), scintillation counters, Cerenlcov detectors, to 
5 name a few. 

Other types of signals generated are well known in the art and have many 
detections means which are known to those of skill in the art. Some of these include 
opposing electrodes, magnetic resonance, and piezoelectric scanning tips. Opposing 
nanoelectrodes can function by measurement of capacitance changes. Two opposing 

10 electrodes create an area of energy storage, located effectively between the two 

electrodes. It is known that the capacitance of such a device changes when different 
materials are placed between the electrodes. This dielectric constant is a value associated 
with the amount of energy a particular material can store (i.e., its capacitance). Changes 
in the dielectric constant can be measured as a change in the voltage across the two 

15 electrodes. In the present example, different nucleotide bases or unit specific markers of 
a polymer may give rise to different dielectric constants. The capacitance changes as the 
dielectric constant of the unit specific marker of the polymer per the equation; C=KCo, 
where K is the dielectric constant and Co is the capacitance in the absence of any bases. 
The voltage deflection of the nanoelectrodes is then outputted to a measuring device, 

20 recording changes in the signal with time. 

Detectable signals are generated, detected and stored in a database. The signals 
can be analyzed to determine structural information about the polymer. The signals can 
be analyzed by assessing the intensity of the signal to determine structural information 
about the polymer. A computer may be used to store tlie database and/or perform the 

25 algorithms described herein. The computer may be the same computer used to collect 
data about the polymers, or may be a separate computer dedicated to data analysis. A 
suitable computer system to implement embodiments of the present invention typically 
includes an output device which displays information to a user, a main unit cormected to 
the output device and an input device which receives input firom a user. The main unit 

30 generally includes a processor connected to a memory system via an interconnection 
mechanism. The input device and output device also are cormected to the processor and 
memory system via the intercormection mechanism. Computer programs for data 
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analysis of the detected signals are readily available from CCD (Charge Coupled Device) 
manufacturers. 

The present invention is further illustrated by the following Examples, which in 
no way should be construed as further limiting. The entire contents of all of the 
5 references (including literature references, issued patents, published patent appUcations, 
and co-pending patent applications) cited throughout this application are hereby 
expressly incorporated by reference. 

Examples 

10 Examvlel: Mavvin^ of 1 2M9 BAC 

A bacterial artificial chromosome, 12M9 BAG, was mapped using bisPNA #K 

tag: 

TMR-OO-Lys-Lys-TTC TTC TC-OOO-JTJ-TTJ-TT-Lys-Lys 
The BAC has target sites at 0.7, 10.2, 73.5, 152.4, 154.5, and 181.4 kb (Fig. 1, top 

15 panel). There are 82 sites with a single mismatch at the ends (SEMM) on this DNA. 
SEMM are sites to which sequence specific probes can bind even though they are not 
100% complementary. These can be estimated from a known nucleic acid sequence. To 
obtain the map, all polymer traces were aligned using the center of molecule reference 
point (CM) and averaged (Fig. 2, top panel). This image represents the averaged signals 

20 from each bound probe on the overlayed but non-oriented DNA polymers. Although 
other internal reference points can be used, CM is particularly useful as the reference 
point when the polymer is incompletely stretched. These polymers are not stretched 
homogeneously but rather can talce a stem and flower conformation. (Manneville et al. 
Europhys. Lett. 36:413-418, 1996.) In this conformation, most fluctuations are 

25 concentrated in the termini regions (i.e., the flower region), while the middle of DNA 
polymer remains highly stretched (i.e., the stem region). Therefore, even in incompletely 
stretched polymers the central portion is usable for analysis. The measured non-oriented 
map (continuous line) is overlapped in Fig. 2, top panel, with an expected (i.e., 
theoretical) one from the published 12M9 sequence (dashed line). The latter was 

30 obtained by representing the sequence on a 0.34 |Lim/lcb scale, including the target signals 
as Gaussian ciirves 5 kb in width (Fig. 1, middle panel), and superimposing the map with 
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its mirror image (Fig. 1, bottom panel). All the peaks expected for 12M9 BAG 
(designated A, B and C) are present in the experimental map. 

One extra peak, designated D, was also present. It was hypothesized that this 
peak represents at least one extra target site, missed in the BAG and human genome 

5 sequence due to a sequencing error. To verify this, this region of the BAG was re- 
sequenced. It was found that beginning at position 64,388 bp, three SEMM sites are 
positioned in close proximity separated by a single base pair. Such close proximity 
alleviates the energetic cost of displacement of the second DNA strand and allows 
formation of stable complexes even with mismatching sites (a so-called P-loop 

10 structure). Because this complex is highly cooperative and can exist only if 2 or 3 

SEMM sites are hybridized simultaneously, peak D is much higher than pealc A which is 
formed by a single target. 

A symmetric pattern of the sharp peaks is clearly visible on top of a featureless 
pedestal at an intensity of about 4-5 (Fig. 2, top panel). For comparison, the signals 

15 derived by DNA bound impurities as measured on untagged BACs is presented on the 
same picture (dot and dash curve). The major portion (if not all) of the measured map 
pedestal is due to signals from these impurities. The S/N ratio is very high for the 
mapping procedure itself. 

To obtain the map profile that includes only DNA molecules oriented in one 

20 direction, we extracted the profiles that inputted into peaks A' and B (Fig. 2, top panel). 
Those pealcs are formed by tags hybridized at positions 73.5 kb and 152.0/154.5 kb, 
respectively, and only molecules oriented in the same direction. Overall signals of these 
selected profiles were summed and averaged. The resulting oriented profile is presented 
in Fig. 2, bottom panel and compared with the theoretically expected profile (dashed 

25 line). 

Example 2: Algorithms, 

In the general case, every peak of a non-oriented map can be tested, selecting the 
molecular traces contributing to it. If the peak includes tag signals from both polymer 
30 orientations, the selected map resembles the total non-oriented map. If the peak is 
formed only by signals from DNA polymers of one orientation, the selected map does 
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not include all the peaks of the total non-oriented map. Moreover, it can be inverted and 
combined with itself to produce the total non-oriented map. 

A similar approach can be applied to discriminate a couple of polymers of similar 
but not identical length. In this case, the pealcs are searched in the total non-oriented 
5 map, which are fomied by input of tags jfrom one polymer only. The criterion for peak 
selection is that the map formed by the polymers inputting in the peak does not have all 
peaks present in the total non-oriented peak. Once the single polymer map is obtained, 
its oriented map can be further determined as described herein. In the case of more than 
two overlapping polymers, the same approach can be used to break the total non-oriented 

10 map into simpler combinations. It should be possible directly (if every polymer has a 
representative peak in the total non-oriented map), or step by step by subtracting single 
polymer maps from the total map and re-iterating the process thereby simplifying the 
map with each iteration. 

The power of this approach is that it is based on the averaged pattern and 

15 therefore is not sensitive to stretching and tagging defects of a particular detected 

polymer profile. Most important are the mapping resolution and degree of labeling of 
every fragment. Better resolution increases the probability of isolating the peak with 
input from one polymer only. A higher degree of labeling improves detection of all 
other pealcs belonging to the polymer. To some extent, incomplete labeling can be 

20 compensated by including more polymer traces in the averaging. 

Selectivity can be further improved, using simultaneously several tags for 
different targets emitting in different spectral regions. In this case, the tags are detected 
independently. If some DNA polymers from the mixture can be identified using a map 
obtained with one of the t^s, this identification can be applied to the maps obtained with 

25 all other tags. Similarly, if the tagging pattern is asymmetric for one of the tags, it can be 
used to orient the maps of this fragment obtained with all other tags. Application of 
different tags not only improves selectivity but also ofiers extra strategies for analysis. 
For example, one of the tags can be selected on the basis that it binds rarely and to be 
used for recognition of the fragments and orienting of their detected traces. In a 

30 complimentary manner, another tag can be selected with a high density of target sites on 
the DNA polymer to provide higher resolution mapping. 
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Algorithms containing such data processing steps can be provided in a software 
package. The software package will be capable of analyzing data from different color 
channels to automatically perform selection, orientation and averaging. The detection 
system used to generate the data can be outfitted with third color excitation and detection 
5 channels. 

Example 3: Genomic Sequencing and Pathosen Analysis. 

This example describes two different types of analysis. In the first case, prior 
knowledge about the analyzed DNA samples is available. One example is mapping of 

10 BAG libraries generated for genomic analysis. For known genomes, rare-cutting 

restriction endonucleases can be used to generate fragments of different sizes. Moreover, 
BACs or small-size genomes can be analyzed as single molecules. A second example is 
the mapping of different strains of the same microorganism. In the latter appUcation, a 
previously unsequenced genome of a known or unknown microbe can be analyzed. The 

15 major difference from the previous case is that restriction enzyme treatment results in an 
unknown distribution of sizes. However, even the use of rare cutters does not always 
guarantee a distribution of fragment sizes appropriate for Imear polymer analysis, using 
systems such as the GeneEngine. To facihtate this analysis, several digests may be 
required. 

20 One application of GeneEngine mapping with xmknown genomes is restriction 

mapping. The major difference from the standard approach based on electrophoresis (for 
example see Brown, "Genomes." New York: John Wiley & Sons Inc. (1999) 472 p.) is 
that in addition to its size, every fi:agment can be characterized by a pattern of bound 
tags. This allows the generation of a species specific "barcode" which can be used, for 

25 example, for strain recognition. This will be useful for example in the fields of infection 
outbreaks in human and agricultural subjects, germ warfare, and the like. 

Equivalents 

The foregoing written specification is considered to be sufficient to enable one 
30 skilled in the art to practice the invention. The present invention is not to be limited in 
scope by examples provided, since the examples are intended as a single illustration of 
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one aspect of the invention and other functionally equivalent embodiments are within the 
scope of the invention. Various modifications of the invention in addition to those 
shown and described herein vwll become apparent to those skilled in the art firom the 
foregoing description and fall within the scope of the appended claims. Hie advantages 
5 and objects of the invention are not necessarily encompassed by each embodiment of the 
invention. 

What is claimed is: 
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1 . A method for analyzing polymer intensity data from a sample comprising 
obtaining intensity profiles from individual labeled polymers contained in the 

sample, 

5 aligning individual intensity profiles firom individual labeled polymers with 

respect to an aligmnent reference point, 

combining aligned individual intensity profiles to generate a population profile, 
selecting a peak in the population profile and obtaining individual intensity 
profiles that contribute to pealc, 
10 combining individual intensity profiles that contribute to the peak to generate a 

peak profile, and 

comparing the peak profile with the population profile. 

2. The method of claim 1 , wherein the sample contains a heterogeneous 
1 5 mixture of polymers . 

3. The method of claim 2, wherein the heterogeneous mixture of polymers 
comprises differentially sized firagments of a parent polymer. 

20 4. The method of claim 2, wherein the heterogeneous mixture of polymers 

comprises polymers with different sequences, 

5. The method of claim 1 , wherein the profiles are intensity versus length 
profi^les. 

25 

6. The method of claim 1 , wherein the hitensity data is fluorescence 
intensity data and intensity profiles are fluorescence intensity profiles. 



7. The method of claim 1, wherein the polymers are labeled with a sequence 
30 specific probe. 
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8. The method of claim 1, wherein the polymers are labeled with a sequence 
non-specific label. 

9. The method of claim 1, wherein the method is implemented on a 
5 computer, 

1 0. The method of claim 1 , wherein the polymer is a nucleic acid. 

1 1 . The method of claim 1 0, wherein the nucleic acid is DNA or RNA. 

10 

1 2. The method of claim 1 1 , wherein the i)N A is genomic nuclear DNA, 
mitochondrial DNA or cDNA. ' 

1 3 . The method of claim 1 1 , wherein the KNA is mRNA. 

15 

14. The method of claim 1, wherein the alignment reference point is an 
internal reference point. 

15. The method of claim 14, wherein the aUgnment reference point is a center 
20 of molecule reference point. 

1 6. The method of claim 14, wherein the alignment reference point is a 
sequence specific probe bound to individual polymers. 

25 1 7. The method of claim 14, wherein the alignment reference point is a 

sequence non-specific probe boimd to individual polymers. 

1 8 . The method of claim 1 , wherein the intensity profiles are obtained from 
individual polymers in flow. 

30 

19. The method of claim 1, wherein the intensity profiles are obtained fi"om 
individual polymers fixed to a solid support. 
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20. The method of claim 1 , wherein the population profile is a cumulative 
popiilation profile. 

5 21. The method of claim 1 , wherein the population profile is an averaged 

population profile. 

22. The method of claun 1 , wherein the peak profile is a cumulative peak 

profile. 

10 

23 . The method of claim 1 , wherein the peak profile is an averaged peak 

profile. 

24. The method of claim 1, wherein the pealc is selected based on intensity. 

15 

25. The method of claim 1, wherein the pealc is selected based on the 
presence of its mirror im^e peak in the population profile. 



26. The method of claim 1, wherein polymers in the sample are sorted 
20 according to size prior to aligning individual intensity profiles. 

27. The method of claim 1 , wherein a peak profile that resembles the 
population profile indicates a non-oriented profile. 

25 28. The method of claim 1, wherein a peak profile that consists of a subset of 

peaks firom the population profile indicates a putative oriented profile. 



29. The method of claim 28, fi:irther comprising inverting the putative 
oriented profile to generate a putative inverted profile, combining the putative oriented 
30 profile with the putative inverted profile to generate a putative non-oriented profile, and 
comparing the putative non-oriented profile with the population profile, wherein a 
putative non-oriented profile that is identical to the popxdation profile indicates that the 
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putative oriented profQe is an oriented profile, that the putative inverted profile is an 
inverted profile, and that the putative non-oriented profile is a non-oriented profile. 

30. The method of claim 28, fijrther comprising determining whether 
individual peaks in the pealc profile have corresponding mirror image peaks in the 
population profile when the alignment reference point is a center of molecule reference 
point. 

3 1 . The method of claim 30, wherein the presence of corresponding miiror 
images indicates the putative oriented profile is an oriented profile. 

32. The method of claim 28, finther comprising determining whether the 
oriented peak has a corresponding mirror image peak in the population profile when the 
alignment reference point is a center of molecule reference point 

33. The method of claim 32, fiirther comprising obtaining individual intetisity 
profiles that contribute to the mirror image peak, and combining individual intensity 
profiles that contribute to the mirror image peak to generate a mirror image peak profile. 

34. The method of claim 33, further comprising comparing the mirror image 
peak profile with the population profile. 

3 5 . The method of claim 3 4, fijrther comprising determining whether the 
mirror image pealc profile is a mirror image of the peak profile. 

36. The method of claim 35, fiirther comprising inverting and combining the 
mirror image pealc profile with the peak profile provided the mirror image peak profile is 
a mirror image of the peak profile. 

37. The method of claim 33, wherein the mirror image peak profile is a 
cumulative mirror image peak profile. 
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38. The method of claim 33, wherein the mirror image peak profile is an 
averaged mirror image peak profile. 

5 39.' The method of claim 28 or 3 1 , further comprising subtracting the pealc 

profile from the population profile. 

40. The method of claim 35, further comprising subtracting the mirror image 
pealc profile from the population profile. 

10 

41 . The method of claim 35, further comprising subtracting the peak profile 
and the mirror image peak profile from the population profile. 

42. The miethod of claim 41, further comprising determining whether 
15 additional peaks remain in the population profile following subtraction of the pealc 

profile and the mirror image peak profile. 

43. The method of claim 42, wherein the presence of additional peaks is 
indicative that tlie sample comprised different polymers. 

20 

44. The method of claim 1, wherein the polymer is completely stretched. 

45. The method of claim 1 , wherein the polymer is partially stretched. 

25 46. The method of claim 3 1 , further comprising inverting the oriented profile, 

combming the oriented profile with the inverted profile to generate a non-oriented 
profile, and comparing the non-oriented profile with the population profile. 

47. The method of claim 8, wherein the sequence non-specific label is a 
30 backbone label. 
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48. The method of claim 47, wherein the aligmnent reference point is a center 
of molecule reference point. 

49. The method of claim 48, wherein the center of molecule reference point is 
5 the midpoint of an individual profile. 

50. The method of claim 1, wherein the pealc is visible in an intensity versus 
length profile. 

10 51. The method of claim 1 , wherein the peak corresponds to bin coxmts. 

52. The method of claim 1, wherein the polymer is imiformly stretched. 

53 . The method of claim 1 , wherein the sample comprises polymers 
1 5 embedded in a gel matrix. 
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